Big data is a subset of multidimensional data. Both contribute (but neither is necessary) to the degree of story-worthiness of your dataset.
Some of the best and most useful data vizualizations are ones that we make for ourselves
January 2016
Big data is a subset of multidimensional data. Both contribute (but neither is necessary) to the degree of story-worthiness of your dataset.
Some of the best and most useful data vizualizations are ones that we make for ourselves
"The greatest value of a picture is when it forces us to notice what we never expected to see." - John Tukey, 1977
library(NHANES) data(NHANES) set.seed(123) ## why set the seed? str(NHANES)
## Classes 'tbl_df', 'tbl' and 'data.frame': 10000 obs. of 76 variables: ## $ ID : int 51624 51624 51624 51625 51630 51638 51646 51647 51647 51647 ... ## $ SurveyYr : Factor w/ 2 levels "2009_10","2011_12": 1 1 1 1 1 1 1 1 1 1 ... ## $ Gender : Factor w/ 2 levels "female","male": 2 2 2 2 1 2 2 1 1 1 ... ## $ Age : int 34 34 34 4 49 9 8 45 45 45 ... ## $ AgeDecade : Factor w/ 8 levels " 0-9"," 10-19",..: 4 4 4 1 5 1 1 5 5 5 ... ## $ AgeMonths : int 409 409 409 49 596 115 101 541 541 541 ... ## $ Race1 : Factor w/ 5 levels "Black","Hispanic",..: 4 4 4 5 4 4 4 4 4 4 ... ## $ Race3 : Factor w/ 6 levels "Asian","Black",..: NA NA NA NA NA NA NA NA NA NA ... ## $ Education : Factor w/ 5 levels "8th Grade","9 - 11th Grade",..: 3 3 3 NA 4 NA NA 5 5 5 ... ## $ MaritalStatus : Factor w/ 6 levels "Divorced","LivePartner",..: 3 3 3 NA 2 NA NA 3 3 3 ... ## $ HHIncome : Factor w/ 12 levels " 0-4999"," 5000-9999",..: 6 6 6 5 7 11 9 11 11 11 ... ## $ HHIncomeMid : int 30000 30000 30000 22500 40000 87500 60000 87500 87500 87500 ... ## $ Poverty : num 1.36 1.36 1.36 1.07 1.91 1.84 2.33 5 5 5 ... ## $ HomeRooms : int 6 6 6 9 5 6 7 6 6 6 ... ## $ HomeOwn : Factor w/ 3 levels "Own","Rent","Other": 1 1 1 1 2 2 1 1 1 1 ... ## $ Work : Factor w/ 3 levels "Looking","NotWorking",..: 2 2 2 NA 2 NA NA 3 3 3 ... ## $ Weight : num 87.4 87.4 87.4 17 86.7 29.8 35.2 75.7 75.7 75.7 ... ## $ Length : num NA NA NA NA NA NA NA NA NA NA ... ## $ HeadCirc : num NA NA NA NA NA NA NA NA NA NA ... ## $ Height : num 165 165 165 105 168 ... ## $ BMI : num 32.2 32.2 32.2 15.3 30.6 ... ## $ BMICatUnder20yrs: Factor w/ 4 levels "UnderWeight",..: NA NA NA NA NA NA NA NA NA NA ... ## $ BMI_WHO : Factor w/ 4 levels "12.0_18.5","18.5_to_24.9",..: 4 4 4 1 4 1 2 3 3 3 ... ## $ Pulse : int 70 70 70 NA 86 82 72 62 62 62 ... ## $ BPSysAve : int 113 113 113 NA 112 86 107 118 118 118 ... ## $ BPDiaAve : int 85 85 85 NA 75 47 37 64 64 64 ... ## $ BPSys1 : int 114 114 114 NA 118 84 114 106 106 106 ... ## $ BPDia1 : int 88 88 88 NA 82 50 46 62 62 62 ... ## $ BPSys2 : int 114 114 114 NA 108 84 108 118 118 118 ... ## $ BPDia2 : int 88 88 88 NA 74 50 36 68 68 68 ... ## $ BPSys3 : int 112 112 112 NA 116 88 106 118 118 118 ... ## $ BPDia3 : int 82 82 82 NA 76 44 38 60 60 60 ... ## $ Testosterone : num NA NA NA NA NA NA NA NA NA NA ... ## $ DirectChol : num 1.29 1.29 1.29 NA 1.16 1.34 1.55 2.12 2.12 2.12 ... ## $ TotChol : num 3.49 3.49 3.49 NA 6.7 4.86 4.09 5.82 5.82 5.82 ... ## $ UrineVol1 : int 352 352 352 NA 77 123 238 106 106 106 ... ## $ UrineFlow1 : num NA NA NA NA 0.094 ... ## $ UrineVol2 : int NA NA NA NA NA NA NA NA NA NA ... ## $ UrineFlow2 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Diabetes : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ... ## $ DiabetesAge : int NA NA NA NA NA NA NA NA NA NA ... ## $ HealthGen : Factor w/ 5 levels "Excellent","Vgood",..: 3 3 3 NA 3 NA NA 2 2 2 ... ## $ DaysPhysHlthBad : int 0 0 0 NA 0 NA NA 0 0 0 ... ## $ DaysMentHlthBad : int 15 15 15 NA 10 NA NA 3 3 3 ... ## $ LittleInterest : Factor w/ 3 levels "None","Several",..: 3 3 3 NA 2 NA NA 1 1 1 ... ## $ Depressed : Factor w/ 3 levels "None","Several",..: 2 2 2 NA 2 NA NA 1 1 1 ... ## $ nPregnancies : int NA NA NA NA 2 NA NA 1 1 1 ... ## $ nBabies : int NA NA NA NA 2 NA NA NA NA NA ... ## $ Age1stBaby : int NA NA NA NA 27 NA NA NA NA NA ... ## $ SleepHrsNight : int 4 4 4 NA 8 NA NA 8 8 8 ... ## $ SleepTrouble : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ... ## $ PhysActive : Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 2 2 2 ... ## $ PhysActiveDays : int NA NA NA NA NA NA NA 5 5 5 ... ## $ TVHrsDay : Factor w/ 7 levels "0_hrs","0_to_1_hr",..: NA NA NA NA NA NA NA NA NA NA ... ## $ CompHrsDay : Factor w/ 7 levels "0_hrs","0_to_1_hr",..: NA NA NA NA NA NA NA NA NA NA ... ## $ TVHrsDayChild : int NA NA NA 4 NA 5 1 NA NA NA ... ## $ CompHrsDayChild : int NA NA NA 1 NA 0 6 NA NA NA ... ## $ Alcohol12PlusYr : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ... ## $ AlcoholDay : int NA NA NA NA 2 NA NA 3 3 3 ... ## $ AlcoholYear : int 0 0 0 NA 20 NA NA 52 52 52 ... ## $ SmokeNow : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA NA NA NA ... ## $ Smoke100 : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ... ## $ Smoke100n : Factor w/ 2 levels "Non-Smoker","Smoker": 2 2 2 NA 2 NA NA 1 1 1 ... ## $ SmokeAge : int 18 18 18 NA 38 NA NA NA NA NA ... ## $ Marijuana : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ... ## $ AgeFirstMarij : int 17 17 17 NA 18 NA NA 13 13 13 ... ## $ RegularMarij : Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 1 1 1 ... ## $ AgeRegMarij : int NA NA NA NA NA NA NA NA NA NA ... ## $ HardDrugs : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ... ## $ SexEver : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ... ## $ SexAge : int 16 16 16 NA 12 NA NA 13 13 13 ... ## $ SexNumPartnLife : int 8 8 8 NA 10 NA NA 20 20 20 ... ## $ SexNumPartYear : int 1 1 1 NA 1 NA NA 0 0 0 ... ## $ SameSex : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA 2 2 2 ... ## $ SexOrientation : Factor w/ 3 levels "Bisexual","Heterosexual",..: 2 2 2 NA 2 NA NA 1 1 1 ... ## $ PregnantNow : Factor w/ 3 levels "Yes","No","Unknown": NA NA NA NA NA NA NA NA NA NA ...
library(dplyr)
NHANES_ltd <- select(NHANES[sample(nrow(NHANES), 500),], ## subset for lighter-weight figures
Age, Gender, Education, HHIncomeMid, Height, BMI_WHO, SexAge, AgeFirstMarij) %>%
mutate(Education = as.ordered(Education),
BMI_WHO = as.ordered(BMI_WHO))
str(NHANES_ltd)
## Classes 'tbl_df', 'tbl' and 'data.frame': 500 obs. of 8 variables: ## $ Age : int 67 21 23 21 13 80 5 3 10 30 ... ## $ Gender : Factor w/ 2 levels "female","male": 2 2 1 2 2 2 2 1 2 1 ... ## $ Education : Ord.factor w/ 5 levels "8th Grade"<"9 - 11th Grade"<..: 1 4 3 3 NA 3 NA NA NA 5 ... ## $ HHIncomeMid : int 17500 NA 100000 22500 7500 22500 12500 50000 70000 100000 ... ## $ Height : num 169 184 142 176 162 ... ## $ BMI_WHO : Ord.factor w/ 4 levels "12.0_18.5"<"18.5_to_24.9"<..: 2 2 3 2 2 2 1 1 2 4 ... ## $ SexAge : int 17 NA NA NA NA NA NA NA NA 18 ... ## $ AgeFirstMarij: int NA NA NA NA NA NA NA NA NA NA ...
Multivariate plots
Lower-variate plots
Powell, V. CSV Fingerpint. http://setosa.io/csv-fingerprint/
plot(NHANES_ltd)
select(NHANES_ltd, Age, Height, SexAge, AgeFirstMarij) %>% pairs()
The pairs plot is useful on its own, but the generalized pairs plot is even better.
Emerson, J. W., Green, W. A., Schloerke, B., Crowley, J., Cook, D., Hofmann, H., and Wickham, H. (2013). The generalized pairs plot. Journal of Computational and Graphical Statistics, 22(1):79–91.
library(ggplot2) library(GGally) print(select(NHANES_ltd, Age, Gender, Height, SexAge, AgeFirstMarij) %>% ggpairs())
Tennekes, M., de Jonge, E., and Daas, P. J., H. (2013). Visualizing and inspecting large datasets with tableplots. Journal of Data Science, 11(2013):43-58. http://bit.ly/tabplot
library(tabplot)
NHANES_ltd2 <- select(NHANES,
Age, Education, HHIncomeMid, Height, BMI_WHO, SexAge, AgeFirstMarij) %>%
mutate(Education = as.ordered(Education),
BMI_WHO = as.ordered(BMI_WHO))
tableplot(NHANES_ltd2, sortCol=Age)
tableplot(NHANES_ltd2, sortCol=BMI_WHO)
tableplot(NHANES_ltd2, sortCol=Education)
Wickham, H., Cook, D., Hofmann, H., and Buja, A. (2010). Graphical inference for infovis. IEEE Transactions on Visualization and Computer Graphics, 16(6).
Do college grads become sexually active later compared with individuals similar individuals with less than a college education?
Can you see the difference?
library(nullabor)
qplot(Education, SexAge, data=NHANES_ltd) %+% lineup(null_permute('SexAge'), NHANES_ltd) +
facet_wrap(~.sample) + geom_boxplot() + theme(axis.text.x = element_text(angle=90, vjust=0.5))
## decrypt("OlCE bQTQ Aw GWPATAWw vr")
decrypt("OlCE bQTQ Aw GWPATAWw vr")
## [1] "True data in position 20"